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We describe how to apply the recently developed pole expansion plus selected inversion (PEpSI) 
technique to Kohn-Sham density function theory (DFT) electronic structure calculations that are 
based on atomic orbital discretization. We give analytic expressions for evaluating charge density, 
total energy, Helmholtz free energy and atomic forces without using the eigenvalues and eigenvectors 
of the Kohn-Sham Hamiltonian. We also show how to update the chemical potential without using 
Kohn-Sham eigenvalues. The advantage of using PEpSI is that it has a much lower computational 
complexity than that associated with the matrix diagonalization procedure. We demonstrate the 
performance gain by comparing the timing of PEpSI with that of diagonalization on insulating and 
metallic nanotubes. For these quasi- ID systems, the complexity of PEpSI is linear with respect to 
the number of atoms. This linear scaling can be observed in our computational experiments when 
the number of atoms in a nanotube is larger than a few hundreds. Both the wall clock time and the 
memory requirement of PEpSI is modest. This makes it even possible to perform Kohn-Sham DFT 
calculations for 10, 000- atom nanotubes on a single processor. We also show that the use of PEpSI 
does not lead to loss of accuracy required in a practical DFT calculation. 

PACS numbers: 71.15.Dx, 71.15.Ap 



I. INTRODUCTION 

Electronic structure calculations based on solving the 
Kohn-Sham equations play an important role in the anal- 
ysis of electronic, structural and optical properties of 
molecules, solids and other nano structures. The effi- 
ciency of such a calculation depends largely on the com- 
putational cost associated with the evaluation of the elec- 
tron charge density for a given potential within a self- 
consistent field (SCF) iteration. The most straightfor- 
ward way to perform such an evaluation is to partially 
diagonalize the Kohn-Sham Hamiltonian by computing 
a set of eigenvectors corresponding to the algebraically 
smallest eigenvalues of the Hamiltonian. The complex- 
ity of this approach is 0{N^)^ where is the number 
of electrons in the atomistic system of interest. As the 
number of atoms or electrons in the system increases, the 
cost of diagonalization becomes prohibitively expensive. 

An alternative scheme for evaluating the charge den- 
sity is to express the charge density as the diagonal of the 
Fermi-Dirac function evaluated at a fixed Kohn-Sham 
Hamiltonian. By approximating the Fermi-Dirac func- 
tion through a pole expansion techniqu^, we can reduce 
the problem of computing the charge density to that of 
computing the diagonal of the inverses of a number of 
shifted Kohn-Sham Hamiltonians. This approach was 
pursued by a number of researchers in the past. The 
cost of this approach depends on the number of poles re- 
quired to expand the Fermi-Dirac function and the cost 
for computing the diagonal of the inverse of a shifted 
Kohn-Sham Hamiltonian. 

The recent work by Lin et al^ provides an accurate 
and efficient pole-expansion scheme for approximating 
the Fermi-Dirac function. The number of poles required 



in this approach is proportional to log(/3A£^), where P is 
proportional to the inverse of the temperature factor, and 
AE is the spectral width of the Kohn-Sham Hamiltonian. 
(i.e. the difference between the largest and the smallest 
eigenvalues). This pole count is significantly lower than 
the previous approaches^ ^. 

Furthermore, an efficient selected inversion algorithm 
for computing the inverse of the diagonal of a shifted 
Kohn-Sham Hamiltonian without computing the full in- 
verse of the Hamiltonian has been developed The 
complexity of this algorithm is 0{Ne) for quasi- ID 
systems such as nanorods, nanotubes and nanowires, 
0{Ne^'^) for quasi-2D systems such as graphene and sur- 
faces, and 0{N^) for 3D bulk systems. In exact arith- 
metic, the selected inversion algorithm gives the exact 
diagonal of the inverse, i.e., the algorithm does not rely 
on any type of localization or truncation scheme. For in- 
sulating systems, the use of localization and truncation 
can be combined with selected inversion to reduce the 
complexity of the algorithm further to 0{Ne) even for 
general 3D systems. 

In the previous worl^^^we used the pole expansion plus 
selected inversion (PEpSI) technique to solve the Kohn- 
Sham problem discretized by a finite difference scheme. 
However, it is worth pointing out that PEpSI is a gen- 
eral technique that is not limited to discretized problems 
obtained from finite difference. In particular, it can be 
readily applied to discretized Kohn-Sham problems ob- 
tained from any localized basis expansion technique. In 
this paper, we describe how PEpSI can be used to speed 
up solving a discretized Kohn-Sham problem obtained 
from an atomic orbital basis expansion. We show that 
electron charge density, total energy, Helmholtz free en- 
ergy and atomic forces can all be efficiently calculated by 
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using PEpSL 

We demonstrate the performance gain we can achieve 
by comparing PEpSI with the LAPACK diagonahzation 
subroutine dsygv on two types of nanotubes. We show 
that by using the PEpSI technique, it is possible to per- 
form electronic structure calculations accurately for a 
nanotube that contains 10,000 atoms on a single proces- 
sor within a reasonable amount of time. The crossover 
point beyond which the computational complexity of 
PEpSI exhibits linear scaling with respect to the num- 
ber of atoms is around a few hundred atoms. 

This paper is organized as follows. In section [TTj we 
show how the PEpSI technique previously developed^ ^ 
can be extended to discretized Kohn-Sham problems ob- 
tained from an atomic orbital expansion scheme. In par- 
ticular, we will show how charge density, total energy, 
free energy and force can be calculated in this formal- 
ism. We will also discuss how to update the chemical 
potential. In section |III[ we report the performance of 
PEpSI on two quasi- ID test problems. 

Throughout the paper, we use 3m{A) to denote the 
imaginary part of a complex matrix A. A properly de- 
fined inner product between two functions / and g is 
sometimes denoted by {f\g). The diagonal of a matrix A 
is sometimes denoted by diag(74). 



II. THEORY 

The ground-state electron charge density p{x) of an 
atomistic system can be obtained from the self-consistent 
solution to the Kohn-Sham equations 



(1) 



where H is the Kohn-Sham Hamiltonian that depends on 
{tlji{x)} are the Kohn-Sham orbitals that satisfy the 
orthonormality constraints J ip^ {x)iljj{x)dx = Sij^ and 
the eigenvalue Si is often known as the ith Kohn-Sham 
quasi-particle energy. Using the Kohn-Sham orbitals, we 
can define the charge density by 



p(x) = £iv,(x)iv,. 



1, 2, oo, 



(2) 



with occupation numbers < < 2, z = l,2,...oo. At 
finite temperature T = the occupation num- 

bers in ([2| can be chosen according to the Fermi-Dirac 
distribution function 



fi = f^{ei- p) 



I ^ g/3(£i-/i) • 



(3) 



and jii is the chemical potential chosen to ensure that 

j p{x)dx = Ne. (4) 



Note that p{x) is simply the diagonal of the single par- 
ticle density matrix defined by 



^{x,x') = '^ilJi{x)ff3{ei - p)iIj-{x'), 



(5) 



i=l 



and the charge sum rule in Q can be expressed alterna- 
tively by 



Tr[7(x,xO] =iVe, 



(6) 



where Tr denotes the trace of an operator. 

It follows from ([T]) and ([5| that the electron density 
p{x) is a fixed point of the Kohn-Sham map defined by 

p{x) = diag (^f0{H[p{x)] - fiS{x, x'))) , (7) 

where p is chosen to satisfy ([6|. The most widely used 
algorithm for finding the solution to (|6| and ([7|) is a Broy- 
den type of quasi-Newton algorithm. In the physics liter- 
ature, this is often referred to as the self-consistent field 
(SCF) iteration. The most time consuming part of this 
algorithm is the evaluation of p{x) = 7(0;, x) in 



Basis expansion by nonorthogonal basis 
functions 



An infinite-dimensional Kohn-Sham problem can be 
discretized in a number of ways (e.g., planewave expan- 
sion, finite difference, finite element etc.) In this paper, 
we focus on a discretization scheme in which a Kohn- 
Sham orbital tjji is expanded by a linear combination of 
a finite number of basis functions {^pj}^ i.e.. 



N 



(8) 



We should note that the total number of basis func- 
tions N is generally proportional to the number of elec- 
trons A^e or atoms in the system to be studied. These 
basis functions {^j} can be constructed to have local 
nonzero support. But they may not necessarily be or- 
thonormal to each other. Examples of these basis func- 
tions are wavelet basis functions^ ad aptive local basis 
functionj^, Gaussian type orbit al^^^ESI ^nd local atomic 
orbit als^^Hi^. In numerical examples presented in sec- 
we use a set of nonorthogonal local atomic or- 
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tion 
bitals. 

Substituting 
problem 



|8| into ([T]) yields a generalized eigenvalue 



HC = SCE, 

where C is an A" x A" matrix with Cij 
entry, S is a diagonal matrix with Si 



(9) 

being its (i, j)th 
on its diagonal. 



Sij = {(pi\(pj)^ and Hij = {(pi\H\(pj). For orthogonal ba- 
sis functions, the overlap matrix S is an identity matrix. 
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and Eq. ([9| reduces to a standard eigenvalue problem. 
When local atomic orbitals are used as the basis, S is 
generally not an identity matrix, but both H and S are 
sparse. 

Without loss of generality, we assume the basis func- 
tions and the Kohn-Sham orbitals to be real in the fol- 
lowing discussion. Let ^ = [V^i,--- , V^at] and ^ = 
[(/)!,•• • , ^at], Then Eq. (|8| can be written in a compact 
form 

^ = ^C. (10) 

Consequently, the single-particle density matrix ([5| be- 
comes 



(11) 



B. 



Pole expansion and selected inversion for 
nonorthogonal basis functions 



The most straightforward way to evaluate j{x^x^) is to 
follow the right hand side of (11), which requires solving 
the generalized eigenvalue problem (|9|. The computa- 
tional complexity of this approach is 0{N^). This ap- 
proach becomes prohibitively expensive when the num- 
ber of electrons or atoms in the system increases. 

An alternative way to evaluate 7(x, x'), which circum- 
vents the cubic scaling of the diagonalization process, 
is to approximate 7(x,x') by a Fermi operator expan- 
sion (FOE) method^^. In an FOE scheme, the func- 
tion //^(S — /i) is approximated by a linear combination 
of a number of simpler functions, each of which can be 
evaluated directly without diagonalizing the matrix pen- 
cil {H^S). A variety of FOE schemes have been devel- 
oped. They include polynomial expansioiP^, rational ex- 
pansioiP^, and a hybrid scheme in which both polyno- 
mials and rational functions are used^^^. In all these 
schemes, the number of simple functions used in the ex- 
pansion is asymptotically determined by PAE^ where 
AE = max^^L ki ~ is the spectrum width for the dis- 
crete problem. 

While most of the FOE schemes require as many as 
0{l3AE) or 0{^/ j3AE) terms of simple functions, the re- 
cently developed pole expansion^ is particularly promis- 
ing since it requires only 0{\ogpAE) terms of simple 
rational functions. The pole expansion takes the form 



1=1 



{zi + /i) ' 



(12) 



where zi^uj^ G C are complex shifts and weights respec- 
tively. We will refer to {zi} as poles in the following 
discussions. The construction of the pole expansion is 
based on the observation that the non-analytic part of 
the Fermi-Dirac function lies only on the imaginary axis 



within 



u 



-zoo, - 



A dumbbell-shaped 



Cauchy contour (see Fig. [T]) is carefully chosen and dis- 
cretized to circle the eigenvalues {si} on the real axis, 
while avoiding the intersection with the non-analytic re- 
gion. The pole expansion does not require a band gap 
between the occupied and unoccupied states. Therefore, 
it is applicable to both insulating and metallic systems. 
Furthermore, the construction of the pole expansion re- 
lies only on the analytical structure of the Fermi-Dirac 
function rather than its detailed shape. This is a key 
property that is crucial for constructing pole expansions 
for other functions, including the reduced free energy 
density matrix and the reduced energy density matrix 
which are discussed in section |IIC| for the purpose of 



computing Helmholtz free energy and atomic forces. 
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FIG. 1: (color online) A schematic view of the 
placement of poles used in a pole expansion 
approximation of f(3{z). The thick black line on the real 
axis indicates the range of — /i, and the thin blue line 
on the imaginary axis indicates the non-analytic part of 
f(3{z). The yellow dumbbell shaped contour is chosen to 
exclude the non-analytic part of the complex plane. 
Each block dot on the contour corresponds to a pole 
used in the pole expansion approximation. 

Following the derivation in the appendix, we can use 



( 12 ) to approximate the single particle density matrix by 



= $(x)7$^(xO. 



-Til ^^ix') 

^^H-{zi^^)Sj ^ ^ (13) 



In the above expression, 7 is an x A/" matrix. It is often 
referred to as the reduced single particle density matrix. 
Using Eq. (13), we can evaluate the electron density in 



the real space as the diagonal elements of 7, i.e., 

p{x) ^ ^{x)^^^{x) = ^jijipj{x)ipi{x). (14) 

We assume that each basis function (pi{x) is compactly 
supported in the real space. In order to evaluate p{x) for 
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any particular we only need jij such that (pj{x)(pi{x) ^ 
0, or Sij ^ 0. This set of jij's is a subset of {jij \Hij ^ 0}. 
To obtain these selected elements^ we need to compute 
the corresponding elements of {H — {zi + jj.)S)~^ for all 

The recently developed selected inversion methocPMHl 
provides an efficient way of computing the selected ele- 
ments of an inverse matrix. For a symmetric matrix of 
the form A = H — zS, the selected inversion algorithm 
first constructs an LDL^ factorization of A, where L is 
a block lower diagonal matrix called the Cholesky fac- 
tor, and D is di block diagonal matrix. In the second 
step, the selected inversion algorithm computes all the 
elements A~j^ such that Lij / 0. Since Lij ^ im- 
plies that Hij ^ 0, all the selected elements of re- 



quired in (14) are computed. As a result, the computa- 



tional scaling of the selected inversion algorithm is only 
proportional to the number of nonzero elements in the 
Cholesky factor L. In particular, the selected inversion 
algorithm has a complexity of 0{N) for quasi-lD sys- 
tems, 0(Ari-^) quasi-2D systems, and 0{N'^) for 3D 
bulk systems. The selected inversion algorithm achieves 
universal improvement over the diagonalization method 
for systems of all dimensions. It should be noted that 
selected inversion algorithm is an exact method for com- 
puting selected elements of A~'^ if exact arithmetic is to 
be employed, and in practice the only source of error is 
the roundoff error. Particularly, the selected inversion 
algorithm does not rely on any localization property of 
though combined with localization property for in- 
sulating systems the computational cost can be further 
reduced which will be studied in future work. 



C. Total energy, Helmholtz free energy and atomic 
force evaluation 



where 7 is the reduced density matrix defined in ([5|. 
Note that in this expression, the first term depends on 
the trace of the product of 7 and H. The computation of 
this term requires only the (i, j)th entry of 7 for (i, j) sat- 
isfying Hij 7^ 0. These entries are already available from 
the charge density calculation, thus using them for to- 
tal energy evaluation does not introduce additional com- 
plexity. All other terms in the total energy expression 
depends on the electron density which we already 

know how to compute b y the PEpSI technique. Here we 
assume LDA^^ or GGA^^^^ exchange-correlation func- 
tional is used for the Kohn-Sham total energy expression. 

For metallic systems, the Helmholtz free energy J^tot 
is the quantity of interest because at finite temperature 
J^tot takes into account both energy and entrop}/^^. The 
Helmholtz free energy can be written as 



tot — 



2/3-^Trln(l + exp(/3(/i-S))) 
p{x)p{y) 



III 
I 



\x-y\ 



■dx dy^E^cip] 



(17) 



{x)p{x) dx. 



It is straightforward to show that as /3 ^ 00, J^tot ^ ^tot- 
Again, in Eq. ( pT] ), only the first term requires special 
treatment. Note that the function 

f^{e = -2/^-1 ln(l + exp(/?(^ - s))) (18) 

is analytic everywhere in the complex plane, ex- 
cept for segments of the imaginary axis within 



u 



—zoo, - 



ITT 

' /3 



In this sense, shares the 

same analytic structure as that of the Fermi-Dirac func- 
tion The pole expansion technique can be applied 
with the same choice of poles {zi} but different weights, 
denoted by {oof}, i.e. 



In addition to reducing the computational complexity 
of charge density calculation in each SCF iteration, the 
PEpSI technique can also be used to compute the total 
energy, the Helmholtz free energy as well as the atomic 
forces efficiently without diagonalizing the Kohn-Sham 
Hamiltonian. 

When Kohn-Sham orbit als are available, the to- 
tal energy associated with an insulating system can be 
evaluated as 



+ £^xc[p]- / V^c[p]{x)p{x)dx. 



dx dy 



(15) 



An alternative expression for E^ot is 

1 f f p(x)Ky) 



\x - y\ 
[p] - J V^c[p]{x)p{x)dx, 



(16) 



1 = 1 



{^1 + P) ' 



(19) 



Following the derivation in the appendix, we can rewrite 
the Helmholtz free energy as 



III 



^tot=Tr[7^5]+/i7V, 

+ E^c[p] - J V^c[p]p{x) dx, 



p{x)p{y) 



\x-y\ 



dx dy 



(20) 



where the reduced free energy density matrix 7-^ is given 

by 



1=1 



H-{zi+n)S' 



(21) 



Again, the selected elements of [H — {zi^ p)S]~'^ required 
for evaluation the first terms of ( [2Q| ) are already available 
from the charge density calculation. No additional com- 
putation is required to obtain these elements. 
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To perform geometric optimization or ab initio molec- 
ular dynamics, we need to compute atomic forces associ- 
ated with different atoms. Atomic force is the derivative 
of the Helmholtz free energy with respect to the position 
of an atom. Following the derivation in the appendix, 
we can express the atomic force associated with the I-th 
atom as 



Fi = 



dT 



= -Tr 



dH 



Tr 



dS 



dRi 



(22) 



where 7^ is the reduced energy density matrix defined 

by 



7^ = CEUiE - 
It is clear that the function 



(23) 



(24) 



shares the same analytic structure as that of the Fermi- 
Dirac function //3. Thus, the reduced energy density ma- 
trix can be approximated by the same pole expansion 
used to approximate the reduced density matrix (13). In 



particular, there is no difference in the choice of poles 
zi. But the weights of the expansion, which we denote 
by cjp, for the reduced energy density matrix approxi- 
mation, are different. To be specific, the reduced energy 
density matrix can be written as 



7 



1=1 



1=1 



(25) 

Again the selected elements of 7^ required in ( [22| ) can 
be easily computed from the selected elements of [H — 
{zi + which are available from the charge density 

calculation. 



D. Chemical potential update 

The true chemical potential /i required in the pole ex- 
pansions ([l3|, (20) and (25) is not known a priori. It 
must be solved iteratively as part of the solution to (|6| 
and For a fixed Hamiltonian H associated with a 



fixed charge density, it is easy to show that the left hand 
side (IgI), which can be expressed as. 



N{fi) = Tr[7] = Tt[j^^^] = Tt[jS] 



(26) 



is a monotonic function with respect ja. Hence the root 
of ([6| is unique. It can be obtained by either the Newton 
Raphson or the bisection method. 

In an SCF iteration, p and /i are often updated in 
an alternating fashion. When the Kohn-Sham quasi- 
particle energies associated with a fixed charge density 
are available, both A/'(/i) and its derivative can be easily 
evaluated in the Newton's method. However, if 7 is ap- 
proximated via a pole expansion ( 13 ), a new expansion is 



needed whenever /i is updated. Furthermore, the deriva- 
tive of N{fi) can be approximated by finite difference. In 
practice, one or two Newton's iterations are sufficient to 
produce a reasonably accurate /i after the first SCF iter- 
ation. When /i^ is sufficiently close to the true chemical 
potential, the derivative of N{fi^) can be approximated 
by 



k-l 



(27) 



Thus each Newton's iteration requires only one more se- 
lected inversion calculation. This type of iterative strat- 
egy for updating the chemical potential has also been 
discussed in literaturd^^. 



III. NUMERICAL RESULTS 

In this section, we report the performance gain we 
achieved by applying the PEpSI technique to an exist- 
ing electronic structure calculation code that uses local 
atomic orbital expansion to discretize the Kohn-Sham 
equations. 

The test problems we used are two types of nanotubes. 
One is a boron nitride nanotube (BNNT) with chirality 
(8,0), which is an insulating system shown in Figure [2j 
The other is a carbon nanotube (CNT) with chirality 
(8,8) shown in Figure |3j which is a metallic system. Ac- 



cording to the formula d 



\/3a 



mn + m^, where 



a is the bond length and (n, m) is the chirality of nan- 
otubes^^, the diameter for BNNT (8,0) is 12.09 Bohr and 
for CNT (8,8) is 20.50 Bohr. The longitudinal length of 
BNNT (8,0) with 256 atoms is roughly the same as CNT 
(8,8) with 512 atoms. 

We performed our calculation at F point only. Be- 
cause Brillouin zone sampling can be trivially paral- 
lelized, adding more /c-points will not affect the perfor- 
mance of our calculation. 




FIG. 2: (color online) Boron nitride nanotube (8,0) with 
256 atoms. The boron atoms are labeled as pink (light 
gray) balls while the nitrogen atoms are labeled as blue 
(dark gray) balls. The bond length between a pair of 
adjacent boron and nitride atoms is 1.45 Angstrom. 

Our computational experiments were performed on the 
Hopper system at the National Energy Research Sci- 
entific Computing (NERSC) center. The performance 
results reported below were obtained from running the 
existing and modified codes on a single core of Hopper 
which is part of a node that consists of two twelve-core 
AMD 'MagnyCours' 2.1-GHz processors. Each Hopper 
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ram? 




^^^^^^^^^^^^^^^^^^^^^^ 




FIG. 3: (color online) Carbon nanotube (8,8) with 512 
atoms. The carbon atoms are labeled as gray balls. The 
bond length between a pair of adjacent carbon atoms is 
1.42 Angstrom. 



node has 32 gigabytes (GB) DDR3 1333-MHz memory. 
Each core processor has 64 kilobytes (KB) LI cache and 
512KB L2 cache. It also has access to a 6 megabytes 
(MB) of L3 cache shared among 6 cores. 

Although the existing code has been parallelized using 
MPI and ScaLAPACK, the parallelization of selected in- 
version is still work in progress. Hence, the performance 
study reported here is limited to single-processor runs. 
However, we expect that the new approach of using the 
PEpSI technique to compute the charge density, total en- 
ergy, Helmholtz free energy and force will have a more 
favorable parallel scalability compared to diagonalizing 
the Kohn-Sham Hamiltonian by ScaLAPACK because it 
can take advantage of an additional level of parallelism 
introduced by the pole expansion. Due to the availability 
of such parallelism, the cost of the computational time of 
PEpSI is reported as the wall clock time for evaluating 
the selected elements of one single pole. 

In addition to comparing the performance of the ex- 
isting and new approaches in terms of wall clock time, 
we will also report the accuracy of our calculation and 
memory usage. 



A. Atomic Orbitals and the Sparsity of H and S 

The electronic structure calculation code we used for 
the performance study is based on a local atomic orbital 
expansion scheme^^ We will refer to this scheme as 
the CGH scheme below. In the CGH scheme, an atomic 
orbital (/>/2(r) is expressed as the product of a radial 
wave function f^^i{r) and a spherical harmonic F/^(f), 
where /i = and represent the 

atom type, the index of an atom, the multiplicity of the 
radial functions, the angular momentum and the mag- 
netic quantum number respectively. The radial function 
fi^,i{r) is constructed as a linear combination of spherical 
Bessel functions within a cutoff radius Tc, i.e.. 







(28) 



where ji{qr) is a spherical Bessel function with q chosen 
to satisfy ji{qrc)=0^ and the coefficients c^qji{qr) are cho- 
sen to minimize a "spillage factor" associated with a 



reference system that consists of a set of (4 or 5) dimers. 
We refer readers to Ref. [15] and [16] for the details on the 
construction of the CGH local atomic orbitals. 

The cutoff radius Vc determines the sparsity of the re- 
duced Kohn-Sham Hamiltonian H and the overlap ma- 
trix S. The smaller the radius, the sparser H and S are. 
The cutoff radius for the atomic orbitals is set to 8.0 Bohr 
for B and N atoms in BNNT, and 6.0 Bohr for C atoms 
in CNT, respectively. 

Another parameter that affects the dimension of H and 
S is the multiplicity ^ of the radial function f^^r). The 
multiplicity determines the number of basis functions per 
atom. A higher multiplicity results in larger number of 
basis functions per atom, which in turn results in more 
rows and columns in H and S. In our experiments, we 
used both single-^ (SZ) orbitals and double-^ plus polar 
orbitals (DZP). The number of local atomic orbitals is 4 
for SZ and 13 for DZP. 

We measure the sparsity by the percentage of the 
nonzero elements in the matrix H denoted by 



Hnnz% — 



nnz{H) 



X 100. 



(29) 



Here nnz(i^) is the number of nonzero elements of H 
and N{H) is the dimension of H respectively. Since the 
computational cost of the selected inversion method is 
determined by the sparsity of L + for the Cholesky 
factor L of H — zS^ we will also report the percentage of 
the nonzero elements in the matrix L + (denoted by 
Lnnz%) below. To reduce the amount of non-zero fill-in 
of L, we use the nested dissection (ND) techniqu^^ to 
reorder the sparse matrix H — zS before it is factored. 
Fig. |4] (a) depicts the sparsity pattern of the H matrix 
associated with a 5120-atom BNNT (8,0) obtained from 
SZ atomic orbitals after it is reordered by ND. The spar- 
sity pattern of L + for the corresponding Cholesky 
factor L of the same problem is shown in Fig. [4] (b). 



^ BNNT 5120 atoms, H 



BNNT 5120 atoms, L 




FIG. 4: (color online) The sparsity pattern of H (a) and 
L + (b) for an 5120-atom BNNT (8,0) with SZ 
orbitals. Nested dissection reordering is used. 

Table [l| shows the sparsity of Hamiltonian matrices as- 
sociated with BNNT (8,0) and CNT (8,8) systems that 
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consist of 64 to 10240 atoms. The Hamiltonians for these 
systems are constructed from SZ atomic orbit als. We re- 
port both the i^nnz% and Lnnz% values. We can clearly 
see from this table that and consequently L, are quite 
dense when the number of atoms in the nanotubes is rel- 
atively small (less than 512). This is due to fact that 
a large percentage of atoms in these small systems are 
within the Vc distance from each other. When the sys- 
tem size becomes larger (with more than 512 atoms), 
both i^nnz% and Lnnz% are inversely proportional to the 
system size. This is because for quasi- ID systems, the nu- 
merator in Eq. ( [29| ) scales linearly with respect to N{H) 
for large N{H). Hence, the resulting matrices become in- 
creasingly sparse, thereby making the selected inversion 
method more favorable. 



B. Performance comparison between 
diagonalization and selected inversion 

We now compare the efficiency of selected inversion 
with that of diagonalization for computing the charge 
density in a single SCF iteration. In the existing code, the 
diagonalization of the matrix pencil (i^, S) is performed 
by using the LAPACK subroutine dsygv when the code 
is run on a single processor. The selected inversion is 
performed by the Sellnv software^. 

We use BNNT(8,0) and CNT(8,8) nanotubes of differ- 
ent lengths to study the scalability of the computation 
with respect to the number of atoms in the nanotube. 
The number of atoms in these tubes ranges from 64 to 
10240. 

Fig. [5] shows how the wall clock time used by Sellnv 
compares with that used by dsygv for BNNT(8,0) of dif- 
ferent sizes. When SZ atomic orbitals are used, Sellnv 
takes almost the same amount of time as that used by 
dsygv for a BNNT with 64 atoms. When the number 
of atoms is larger than 64, Sellnv is more efficient than 
dsygv. The cubic scaling of dsygv with respect to the 
number of atoms can be clearly seen from the slope of the 
blue loglog curve, which is approximately 3. The linear 
scaling of Sellnv, which is indicated by the slope of the 
red curve, is evident when the number of atoms exceeds 
200. For systems with less than 200 atoms, the wall clock 
time consumed by Sellnv scales cubically with respect 
to the number of atoms also. This is due to the fact that 
the H and S matrices associated with these small systems 
are nearly dense. Similar observations can be made when 
the DZP atomic orbitals are used. In this case, Sellnv 
is already more efficient than dsygv when the number of 
atoms is only 64. The linear scaling of Sellnv can be 
observed when the number of atoms exceeds 128. 

Fig. [6] shows the timing comparison between Sellnv 
and dsygv for CNT (8,8) of different sizes. Because the 
cutoff radius for the carbon atom is chosen to be 6.0, 
which is smaller than that associated with the boron and 
nitrogen atoms, the H and S matrices associated with 
CNT (8,0) are sparser even when the number of atoms in 



the tube is relatively small. This explains why Sellnv is 
already more efficient than dsygv already for a CNT with 
64 atoms regardless whether SZ or DZP atomic orbitals 
are used. However, the linear scaling of Sellnv timing 
with respect to the number of atoms does not show up 
until the number of atoms reaches 500. The increase in 
the crossover point is due to the fact that the sparsity of 
H is asymptotically determined by the number of atoms 
per unit length of the nanotube. Because the CNT (8,0) 
we use in our experiment has a large diameter, there are 
more atoms along the radial direction per unit length 
in CNT than that in BNNT. Consequently, it takes al- 
most twice as many as atoms for CNT to reach the same 
length along the longitudinal direction when compared 
to BNNT, as we can see from Fig. |2] and Fig. |3] 

We should note here that it is possible to combine the 
PEpSI technique with a SZ atomic orbital based Kohn- 
Sham DFT solver to perform electron structure calcula- 
tion on quasi-lD systems with more than 10,000 atoms. 
On the Hopper machine, the wall clock time used to per- 
form a single selected inversion of the H — zS matrix 
associated with a 5,120-atom BNNT(8,0) is 26.72 sec- 
onds. When the number of atoms increases to 10240, the 
wall clock time increases to 50.07 seconds. Similar perfor- 
mance is observed for CNT(8,8). It takes 47.59 seconds 
to perform a selected inversion for a 5120-atom CNT(8,8) 
tube, and 97.16 seconds for a 10240-atom tube. 



C. Memory usage 

We should also remark that the memory requirement 
for Sellnv increases linearly with respect to the number 
of atoms when the nanotube reaches a certain size. For 
a nanotube that consists of 10240 atoms, the amount of 
memory required to store L and the selected elements 
of [H - {zi + fi)S]-^ is 0.66 GB and 0.93 GB respec- 
tively. The relatively low memory requirement of Sellnv 
for quasi- ID system suggests that the method may even 
be applicable to quasi- ID systems that contain more than 
100, 000 atoms on a single processor. 



D. Accuracy 

When selected inversion can be computed to high accu- 
racy, which is often the case in practice, the only source of 
error introduced by the PEpSI technique comes from the 
limited number of terms in the pole expansion (13). The 



number of poles needed in (13) to achieve a desired level 
of accuracy in total energy(or Helmholtz free energy) 
and force is largely determined by the inverse tempera- 
ture /3 = l/{kBT) used in ([3| and the spectrum width 
/S.E. Here we show that at room temperature T = 300i^, 
the number of poles required to provide an accurate pole 
expansion approximation is modest even for a metallic 
system such as CNT(8,8). Table [ll| shows that when di- 
agonalization is replaced by PEpSI for a single F-point 
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# Atoms 


64 


128 


256 


512 


1024 


1920 


5120 


10240 


BNNT (8,0) 


Hnnz% 
-t^nnz% 


100.00 
100.00 


85.54 
99.48 


42.77 
77.94 


21.43 
46.13 


11.69 
25.07 


5.70 
13.70 


2.13 
5.26 


1.06 
2.64 


CNT (8,8) 


Hnnz% 


40.63 
69.92 


38.67 
68.45 


19.53 
68.70 


9.77 
54.38 


4.88 
31.75 


2.60 
17.54 


0.97 
7.42 


0.49 
3.79 



TABLE I: The percentage of nonzero elements i^nnz% and Lnnz% for BNNT (8,0) and CNT (8,8) of various sizes. 



calculation, the errors in total energy and force decrease 
as the number of poles in (13) increases. When the num- 
ber of poles reaches 80, the difference between the final 
total energies produced by the existing code and the mod- 
ified code (which replaces diagonalization with PEpSI) is 
3.6 X 10~^ eV. The difference in the mean absolute error 
(MAE) is 2 X 10~^ eV/Angstrom, which is quite small 
for all practical purposes. 



# Poles 


^PEpSI — Eref (eV) 


MAE Force (eV/ Angstrom) 


20 


5.868351108 


0.400431 


40 


0.007370583 


0.001142 


60 


0.000110382 


0.000026 


80 


0.000000360 


0.000002 



TABLE IL The difference between the total energy and 
atomic force produced by the existing electronic 
structure code and modified version in which 
diagonalization is replaced by PEpSL The difference in 
atomic force is measured in terms of the mean absolute 
error (MAE). 
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FIG. 6: (color online) Comparisons of the wallclock 
time by selected inversion (at one pole) required for 
PEpSI and by the LAPACK dsygv used to diagonalize 
a Kohn-Sham Hamiltonian associated with CNT(8,8). 
The Hamiltonians are constructed from SZ orbit als (4 
basis per atom) in (a) and DZP orbitals (13 basis per 
atom) in (b). 
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FIG. 5: (color online) Comparisons of the wallclock 
time used by selected inversion (at one pole) required 

for PEpSI and by the LAPACK dsygv used to 
diagonalize a Kohn-Sham Hamiltonian associated with 
BNNT (8,0). The Hamiltonians are constructed from 
SZ orbitals (4 basis per atom) in (a) and DZP orbitals 
(13 basis per atom) in (b). 



IV. CONCLUSION 

In this paper, we generalized the recently developed 
pole expansion and selected inversion technique (PEpSI) 
for solving finite dimensional Kohn-Sham equations ob- 
tained from an atomic orbital expansion. We gave ex- 
pressions for evaluating electron density, total energy, 
Helmholtz free energy and atomic forces without using 
eigenvalues and eigenvectors of a Kohn-Sham Hamilto- 
nian. These expressions are derived from an FOE ap- 
proximation to the Fermi-Dirac function using an ef- 
ficient and accurate pole expansion technique. They 
only use selected elements of the reduced density ma- 
trix, reduced energy density matrix and reduced free en- 
ergy density matrix. These selected elements can be ob- 
tained from computing selected elements of the inverse of 
a shifted Kohn-Sham Hamiltonian through the selected 
inversion technique. The complexity of selected inver- 
sion is 0{Ne) for quasi- ID systems such as nanorods, 
nanotubes and nanowires, O^nV^) for quasi-2D systems 
such as graphene and surfaces, and 0{N^) for 3D bulk 
systems. It compares favorably to the complexity of di- 
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agonalization, which is 0{N^). We reported the perfor- 
mance gain we can achieve by comparing the efficiency 
of PEpSI with that of diagonalization on two types of 
nanotubes. The linear scaling of PEpSI with respect to 
the number of atoms is clear when the number of atoms 
in these quasi- ID systems is larger than a few hundreds. 
Even when these nanotubes contain fewer than a hundred 
atoms, PEpSI still appears to outperform a diagonaliza- 
tion based DFT calculation. For quasi-2D and quasi- 
3D systems, we expect the crossover point over which 
PEpSI exhibits 0{n!^'^) and 0{N^) scaling to be much 
larger. However, based on the experiments presented 
here, PEpSI may still be more efficient than diagonal- 
ization (before the crossover point is reached) as long as 
the Cholesky factors of the shifted Kohn-Sham Hamilto- 
nian are not completely dense. 

The computational experiments we presented above 
were performed on a single processor. For quasi- ID sys- 
tems such as nanotubes, the use PEpSI allows us to 
tackle problems that contain as many as 10,000 atoms. 
This cannot be done by using a diagonalization based ap- 
proach. For quasi-2D and 3D systems, a parallel imple- 
mentation of the PEpSI, which we are currently working 
on, is required to solve problems with that many atoms. 
We will report the performance for these large-scale cal- 
culations in a future publication. 
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APPENDIX 



Derivation of Eq. (13): 



S is a diagonal matrix, and the pole expansion (12) 
can be applied to each component of E as 



where / is an x identity matrix. Using Eq. ( 11 ), the 
single particle density matrix can be written as 



7(x, x') ^ ^(x)CTm^ „ ^ C^^^{x') 



1=1 



(30) 



Since the generalized eigenvalue problem ([9| implies the 
identity 



C^HC = S, C^SC = /, 



(31) 



the single particle density matrix takes the form 
p 



(32) 



which is Eq. (13). 



Derivation of Eq. (|20|): 

The first term in t 



le Helmholtz free energy functional 



Tr[7-^S']. 



(33) 



The second equal sign in Eq. ( |33| ) defines the reduced free 
energy density matrix 7"^, which can be evaluated using 
the pole expansion (19) as 



1=1 
p 

3m 



•'I 



1=1 
p 



C-THC-^ - ziC-TC-'^ 

T 



(34) 



1=1 



which is Eq. ([20|. 

Derivation of Eq. (22): 



The atomic force is in general given by the derivative 
of the Helmholtz free energy with respect to the atomic 
positions. Using the representation of the Helmholtz free 
energy in Eq. (ITt]), and the fact that 



{fi)'{z) = fp{z), iVe = Tr[/^(S-M)], (35) 
it can be derived that 



d 



-Ft 



d 



dRr dRi 



(TV[/^(S - - fiN, 



-Tr 
-Tr 
-Tr 

Tr 
-Tr 

Tr 



(/^)'(S-A.) 



as 



dRi dRi 



ffi{=' - m) 



as 

dRr 



(iVe-T:[/^(S-M)]) 



Tr 



dC 
dRi 



OH 



Tr 



fp{~-ti)^HC 



dRr 



dC 



(36) 



The second and the third terms in Eq. (36) come from 



the nonorthogonality of the basis functions and should 
be further simplified. We have 
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Tr 
:Tr 
:Tr 



Tr 



dC 
dRr 



C-')[C{C" HC)fp{E - fi)C" ]{C-' C-')C 



dRi 



Tr 



(37) 



Define the reduced energy density matrix as in Eq. (23), 
and Eq. (37) can be simplified as 



Tr 
:Tr 



j^S C 



dRi dRi 



^ ^ dRi 



-Tr 



dS 



dRr 



Combining Eq. (38) and Eq. (36), we have 



Fi 



dRi 



Tr 



dH 



-Tr 



dS 



7 



dRi 



(38) which proves Eq. p2). 
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